Image and Text Feature Based Multimodal Learning for Multi-Label

Classiﬁcation of Radiology Images in Biomedical Literature

Md. Rakibul Hasan

, Md Rafsan Jani

and Md Mahmudur Rahman

Computer Science Department, Morgan State University, Baltimore, Maryland, U.S.A.

Keywords:

Biomedical Image Annotation, Image Retrieval, Multimodal Learning, ResNet50, ViT, CNN, DistilGPT2.

Abstract:

Biomedical images are crucial for diagnosing and planning treatments, as well as advancing scientiﬁc under-

standing of various ailments. To effectively highlight regions of interest (RoIs) and convey medical concepts,

annotation markers like arrows, letters, or symbols are employed. However, annotating these images with

appropriate medical labels poses a signiﬁcant challenge. In this study, we propose a framework that leverages

multimodal input features, including text/label features and visual features, to facilitate accurate annotation

of biomedical images with multiple labels. Our approach integrates state-of-the-art models such as ResNet50

and Vision Transformers (ViT) to extract informative features from the images. Additionally, we employ Gen-

erative Pre-trained Distilled-GPT2 (Transformer based Natural Language Processing architecture) to extract

textual features, leveraging their natural language understanding capabilities. This combination of image and

text modalities allows for a more comprehensive representation of the biomedical data, leading to improved

annotation accuracy. By combining the features extracted from both image and text modalities, we trained a

simpliﬁed Convolutional Neural Network (CNN) based multi-classiﬁer to learn the image-text relations and

predict multi-labels for multi-modal radiology images. We used ImageCLEFmedical 2022 and 2023 datasets

to demonstrate the effectiveness of our framework. This dataset likely contains a diverse range of biomedical

images, enabling the evaluation of the framework’s performance under realistic conditions. We have achieved

promising results with the F1 score of 0.508. Our proposed framework exhibits potential performance in an-

notating biomedical images with multiple labels, contributing to improved image understanding and analysis

in the medical image processing domain.

1 INTRODUCTION

The advent of digital technology in the biomedical

ﬁeld has led to an exponential increase in the vol-

ume of available radiology images and associated tex-

tual data within biomedical literature. This wealth

of information, while invaluable, presents signiﬁcant

challenges in terms of efﬁcient classiﬁcation and re-

trieval (Dhawan, 2011). As a result, developing tool

for annotating and classifying medical images to as-

sist users, such as patients, researchers, general prac-

titioners, and clinicians, in ﬁnding pertinent and help-

ful information is being considered as the active re-

search domain in the biomedical sector (Demner-

Fushman et al., 2009). The paper titled ”Image and

Text Feature-based Multimodal Learning for Multi-

label Classiﬁcation of Radiology Images in Biomed-

https://orcid.org/0000-0002-6179-2238

https://orcid.org/0000-0001-7304-087X

https://orcid.org/0000-0003-0405-9088

ical Literature” addresses the challenge of contribut-

ing in this domain by exploring the integration of both

image and text features in the multi-classiﬁcation pro-

cess.

Radiology images, such as X-rays, Computed

Tomography (CT) scans, and Magnetic Resonance

Imaging (MRI), are a cornerstone of medical diag-

nostics and research, offering vital insights into var-

ious medical conditions (Azam et al., 2022). How-

ever, the sheer volume and complexity of these im-

ages, coupled with the accompanying textual descrip-

tions, necessitate advanced methods for effective or-

ganization and retrieval (Rahman et al., 2015). Tra-

ditional approaches often rely heavily on either text-

based or image-based features, neglecting the poten-

tial synergy of a multimodal approach (Ritter et al.,

2011). However, this paper explores the approach of

multimodal learning by exploiting different state-of-

the art deep learning frameworks to leverage both im-

age and text features for the multi-label classiﬁcation

Hasan, M., Jani, M. and Rahman, M.

Image and Text Feature Based Multimodal Learning for Multi-Label Classiﬁcation of Radiology Images in Biomedical Literature.

DOI: 10.5220/0012438400003657

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2024) - Volume 2, pages 679-686

ISBN: 978-989-758-688-0; ISSN: 2184-4305

679

of radiology images in biomedical literature. By in-

tegrating visual cues from the images with contextual

information derived from textual data, our approach

aims to enhance the accuracy and efﬁciency of classi-

ﬁcation tasks. This is particularly crucial in the con-

text of biomedical literature, where the precise cate-

gorization of images is essential for aiding research,

clinical decision-making, and educational purposes.

To achieve the aim of this research paper, we uti-

lized the medical image dataset obtained from the

ImageCLEFmedical Caption Tasks of 2022 (Ionescu

et al., 2022) and 2023 (Ionescu et al., 2023), (Barr

on-

Cedeno et al., 2023). The dataset of the year 2022

comprises 83,275 training images and 7,645 valida-

tion images. The 2023 dataset also includes 60,918

training images, and 10,437 validation images. The

medical images on these datasets are multi-modal

which includes X-ray, CT scan, MRI, Ultrasound,

PET scan, Angiogram, and other types of radiology

images. Each of the images pertaining to the train

and validation sets are associated with captions and

concepts. The medical concepts were presented fol-

lowing the UMLS format. Later on, 2,125 Concept

Unique Identiﬁers (CUIs) were employed to repre-

sent the UMLS terminologies which results each im-

age having multiple CIUs or labels. The test dataset is

not used or reported here because it only has the im-

ages, corresponding concepts and captions were kept

hidden for the competition purposes. It’s important to

note that both of these datasets are the updated and

extended version of the ROCO (Radiology Objects in

Context) image dataset collected from various open

access journals in PubMed (Pelka et al., 2018).

For our research, we have used 2023 CLEF train-

ing and validation images as our training and testing

purposes, correspondingly. As a result, 60,918 im-

ages are used for train our intended model and 10,437

images are used to test our trained model. On the

other hand, 16,358 images selected from the training

and validation image sets of the 2022 year are used

for validation purpose in this research. The images of

the 2022 dataset having CUIs not used in 2023 are ex-

cluded from our validation dataset. As a result, the to-

tal number of unique labels is kept within the number

of 2,125. Moreover, the possibility of repetitive use of

same images in training, validation, and test datasets

are minimized. In total 86,993 images are used in our

research, whereas the ratio of training, validation, and

test image-label pairs are approximately 70%, 20%,

and 10%, respectively.

Top three of the most frequent CUIs found on

the train and validation dataset are ’C0040405’ (Fre-

quency: 24,695), ’C1306645’ (Frequency: 19,833),

and ’C0024485’ (Frequency: 11,554); correspond-

ing UMLS are ’Magnetic Reasonance Imaging’,

’Anterior-Posterior’, and ’Angiogram’. Each image

has average ﬁve multi-labels or CUIs and maximum

number of labels for an image found in the dataset

is ﬁfty. In addition, the set of the image captions

has 23,237 corpus of words. The maximum number

of words in a caption for an image is found as 316,

however the 99% images are having 90 or less than

number of words as caption. Along with the respec-

tive multi-labels represented as CUIs/UMLS and cap-

tions, we have prepared processed caption for each

image based on the UMLS. For example, an image

has the above mentioned three UMLS, then the pro-

cessed caption for that image is prepared by placing

”This image shows” at the beginning and followed the

UMLS sequentially. This is worthy to mention that in

the original dataset the UMLS are the keywords de-

rived from the captions, as a result the processed cap-

tion works better instead of pre-processing the origi-

nal captions with some standard natural language pro-

cessing techniques like removing stop words, special

characters, numeric values, and converting to lower

case, etc.

Figure 1 shows an instance from the dataset we

have used in our study.

Figure 1: A sample MRI image from our test dataset

(CC BY-NC [Murvelashvili et al. (2021)]). The corre-

sponding CUIs: [‘C0024485’, ‘C0449900’, ‘C0006104’,

‘C0014008’], UMLS: [‘Magnetic Resonance Imaging’,

‘Contrast used’, ‘Brain’, ‘Empty Sella Syndrome’], Pro-

cessed Caption: ‘this image shows magnetic resonance

imaging contrast used brain empty sella syndrome’, and

Original Caption: ‘Contrast-enhanced T1-weighted sagittal

image of the brain 1 month after initial presentation. The

arrow shows a mostly empty sella.’.

2 RELATED WORKS

Multi-label classiﬁcation of medical images is a task

that involves assigning multiple labels or categories to

an image, allowing for a more comprehensive and nu-

anced representation of the content within the image.

This task is particularly relevant in the biomedical

HEALTHINF 2024 - 17th International Conference on Health Informatics

680

ﬁeld, where images often contain multiple character-

istics, ﬁndings, or abnormalities that need to be accu-

rately identiﬁed and labeled. The objective of multi-

label classiﬁcation is to develop a model that can fore-

tell the pertinent labels for a given input, which could

be radiology reports, images, or any other kind of data

(Zhang and Zhou, 2013). Each instance in the labeled

dataset used to train the model has a set of labels at-

tached to it.

The recent progress relavant to the task of multi-

label classiﬁcation mainly evolves around two types

of deep learning models based on Convolutional Neu-

ral Network (CNN) approach and Transformer ap-

proach. Several state-of-the-art (SOTA) CNN archi-

tectures have been employed for multi-label classiﬁ-

cation of medical images. To tackle the challenges as-

sociated with multi-label classiﬁcation, various prac-

tices and methodologies have been adopted. One

approach involves utilizing pre-trained CNN mod-

els, trained on large-scale generic image datasets, and

ﬁne-tuning them on speciﬁc medical image datasets

(Tajbakhsh et al., 2016). These includes the tech-

niques of applying ensemble method, transfer learn-

ing, and using pre-trained models/weights on com-

paratively larger image sets. Transfer learning, as

the most prominent one, involves transferring knowl-

edge learned from one domain to another, has been

effective in improving the performance of models

when training data is limited.Standard CNN archi-

tectures such as ResNet (He et al., 2016), Inception-

Net (Szegedy et al., 2016), EfﬁcientNet (Tan and Le,

2019), VGGNet (Simonyan and Zisserman, 2014),

and DenseNet (Huang et al., 2017) are widely used for

this task due to their ability to capture intricate visual

features and patterns from images. Moreover, each of

these architectures comes with some unique features

to demonstrate their corresponding ability of image

analysis. In (Hasan et al., 2023), (Yeshwanth et al.,

2023) workshop notes, the participants of the Image-

CLEF2023 medical tasks, the use of DenseNet121

were demonstrated for the task of concept detection or

multi-label classiﬁcation using the CLEF23 dataset.

Futhermore, in (Kaliosis et al., 2023) and (Shinoda

et al., 2023), EfﬁcientNetV2 and the fusion of Efﬁ-

cientNet with DenseNet were utilized for multi-label

classiﬁcation. VGG16 and a customized CNN archi-

tecture nmaed as ConceptNet were employed by (Rio-

Torto et al., 2023) and (Mohamed and Srinivasan,

2023), respectively.

Overall, the core idea of Convolution is excellent

for tasks where local spatial or temporal relationships

are key, and its efﬁciency and simplicity make it a

staple in many applications. Attention, on the other

hand, shines in scenarios where the model needs to

dynamically focus on different parts of the input, cap-

turing long-range interactions and dependencies ef-

fectively (He et al., 2016), (Xu et al., 2015). More

precisely, in the context of multi-modal medical im-

age analysis to extract image features and establish

long range dependencies between the modalities at-

tention based Transformer architectures are thriving

in the recent times (Dai et al., 2021). Vision Trans-

former (ViT) (Dosovitskiy et al., 2020), Swin Trans-

former (Liu et al., 2021), and Data-efﬁcient Image

Transformer (DeIT) (Touvron et al., 2021) are be-

ing widely used in medical image classiﬁcation tasks

(Manzari et al., 2023), (Okolo et al., 2022).

3 MODEL IMPLEMENTATION

The visual representations that are extracted from im-

ages, generally utilizing methods like Convolution

and/or Transformer based architectures, are referred

to as image features. Effective image analysis and

classiﬁcation are made possible by these features,

which capture signiﬁcant visual patterns and char-

acteristics contained in the images (Liu and Deng,

2015). Contrarily, text features entail the display

of text-based content such image captions, radiolog-

ical reports, or clinical notes. Deep leaning archi-

tectures specialized on processing natural or scien-

tiﬁc languages/texts can be used to extract text fea-

tures, which convert the text into numerical represen-

tations which also convey important insights about

the corresponding medical images besides image fea-

tures (Pennington et al., 2014). As a result, the ap-

proach of incorporating textual information with vi-

sual features, such as radiology reports or image cap-

tions, alongside the images for multi-modal learning

is adopted in this research (Yao et al., 2019). By com-

bining both visual and textual features, models can

leverage the complementary information from both

modalities to improve classiﬁcation accuracy. Multi-

modal learning mixes text and visual features to take

advantage of complimentary data from several modal-

ities. Multimodal models seek to enhance the per-

formance of tasks like picture classiﬁcation, object

recognition, or image captioning by combining the vi-

sual and linguistic characteristics. This strategy uses

both visual and semantic cues to provide a more thor-

ough grasp of the material (Frome et al., 2013).

The approach of connecting image and text fea-

tures is recently being applied in the ﬁeld of auto-

mated caption generation for images. Such as CLIP

(Contrastive Language–Image Pre-training) (Radford

et al., 2021) model compresses both image and

text features to establish image-to-text connection in

Image and Text Feature Based Multimodal Learning for Multi-Label Classiﬁcation of Radiology Images in Biomedical Literature

681

zero-shot manner. Another state-of-the-art model

BLIP (Bootstrapping Language-Image Pre-training)

(Li et al., 2022) works at the same of unifying both

vision and language features to generate captions.

In our implemented model due to the constraint

of computational resources, we have tried to exploit

the prediction prowess of pre-trained SOTA CNN and

Transformer based models followed by a simple CNN

classiﬁer. The pre-trained models we have used here

are ResNet50, ViT32, and DistilGPT2.

ResNet (Residual Neural Network) (He et al.,

2016) is a groundbreaking neural network architec-

ture that has addressed the vanishing gradient prob-

lem of training very deep neural networks by intro-

ducing residual connections, enabling the training of

models with hundreds or even thousands of layers.

The residual block is the key building block of the

ResNet. Each residual block consists of a sequence

of convolutional layers, batch normalization, and rec-

tiﬁed linear unit (ReLU) activations. The innova-

tion lies in the introduction of skip connections, also

known as shortcut connections, which allow the net-

work to learn residual mappings (Yu et al., 2018). The

skip connections in ResNet enable the direct propa-

gation of information from one layer to subsequent

layers.

Vision Transformer (ViT) extends the Trans-

former model, which was initially developed for nat-

ural language processing, to the ﬁeld of computer vi-

sion. ViT analyzes images as a series of ﬂattened

patches rather than the traditional grid-based convo-

lutional neural networks (CNNs) that do the same.

Each of the ﬁxed-size patches (Dosovitskiy et al.,

2020) that make up ViT’s input image is linearly pro-

jected to a lower-dimensional representation. Posi-

tional embeddings are provided to incorporate spa-

tial information, enabling the model to understand

the relative placements of patches. The patch em-

beddings are then passed via a Transformer encoder,

which consists of several feed-forward neural net-

works and self-attention layers. A crucial element

of the Transformer encoder (Touvron et al., 2021) is

self-attention, which enables the model to recognize

inter-dependencies and focus on various areas of the

image while processing. ViT learns complex local

and global context representations by simultaneously

monitoring all patches. The computation of attention

weights between pairs of patches is carried out by the

self-attention mechanism. Long-range dependencies

are now simpler to represent, and patches can now

communicate with one another. In order to conduct

image classiﬁcation tasks with an accurate prediction

of the class labels, a classiﬁcation head is added af-

ter the patch embeddings have been processed by the

Transformer encoder. ViT is frequently trained using

supervised learning on big datasets during the train-

ing phase, and the model parameters are optimized

via backpropagation and gradient-based optimization.

A state-of-the-art language model that excels at

natural language processing (NLP) tasks is called

GPT (Generative Pre-trained Transformer). Trans-

former served as the foundation for GPT, which em-

ploys self-attention processes to identify connections

and complicated language patterns in textual data

(Radford et al., 2019). An important variation of

GPT2 is Distilled-GPT2 or DistilGPT2 developed by

Hugging Face (Sanh et al., 2019). DistilGPT-2 can be

used in similar applications as GPT-2, including text

generation, chatbots, content creation, and more. Its

smaller size makes it particularly useful in scenarios

where deploying a full GPT-2 model would be im-

practical due to resource constraints. Moreover, Dis-

tilGPT2 is believed to be more efﬁcient in generat-

ing scientiﬁc texts. The self-attention (Vaswani et al.,

2017) method, which enables the model to concen-

trate on key portions of the input sequence while con-

structing each word, helps with this.

Figure 2: The Implmented Model.

According to Figure 2, we have utilized the pre-

trained ResNet50 and ViT32 architectures to extract

the visual features of the images. Each image will

have the feature vector of size=(768,). On the other

hand, the text feature vectors are extracted based on

the processed captions of each image by using the

DistilGPT2 pre-trained model, which has the embed-

ding size=(768,) for each caption. After getting the

visual and textual features a Concatenation() tech-

HEALTHINF 2024 - 17th International Conference on Health Informatics

682

nique is applied to merge the vectors which gener-

ates size=(2304,). Later on, a simpliﬁed CNN based

multi-classiﬁer is employed to feed the concatenated

feature vectors into it and predict multi-labels or con-

cepts for the images from the test dataset using 2,125

unique classes.

Figure 3 depicts the custom architecture of

our simpliﬁed CNN multi-classiﬁer, which consists

of Reshape layer at the beginning to convert the

size=(2304,) to size=(48x48x1) required for the next

Convolutional2D layer, then the next sequence of lay-

ers is MaxPooling2D ⇒ Flatten ⇒ Dense layer. Fi-

nally, a Dense layer with ‘sigmoid’ activation func-

tion is implemented to predict the multi-labels from

the set of 2,215 labels. Furthermore, the Binary

Crossentropy loss function is used to predict the prob-

ability of each labels for a test image.

Figure 3: The Simpliﬁed CNN based Multi-classiﬁcation.

4 EXPERIMENTS & RESULTS

Figure 4 depicts the loss and accuracy both for the

training and validation phases. Here, loss represents

the discrepancy between the predicted labels and the

true labels. On the other hand, accuracy is a met-

ric that measures the proportion of correctly classiﬁed

samples out of the total number of samples.

Table 1 shows the key performance metrics of

model in predicting multi-labels for the test dataset.

Precision is the measure of the model’s ability to

correctly identify positive samples out of all the sam-

ples predicted as positive. Recall is the measure of

Figure 4: Loss and Accuracy.

Table 1: Key Performance Metrics.

Accuracy Precision Recall F1-

score

Training 0.329 - - -

Validation 0.326 - - -

Testing - 0.697 0.447 0.508

the model’s ability to correctly identify positive sam-

ples out of all the true positive samples. The F1 score

combines precision and recall providing a balanced

evaluation of the model’s performance, considering

both false positives and false negatives. We have also

conducted an investigation of predicting multi-labels

without using text features, which results less perfor-

mance than our proposed model of using both visual

and textual features. More precisely, the F1 score is

found 0.31 (approximately) for the model of exclud-

ing text features. On the other hand, comparing our

results with the CLEF2023 challenge participant’s re-

sults will be misleading, because the F1 scores of

those participants were calculated by the organizer us-

ing the test dataset where the associated multi-labels

were not disclosed.

However, Figure 5-7 depicts the predicted multi-

labels for some of the test images used in our research

in comparison to the actual ground truth labels.

Image and Text Feature Based Multimodal Learning for Multi-Label Classiﬁcation of Radiology Images in Biomedical Literature

683

Figure 5: Random Test Image (CC BY-NC [Ogamba et al.

(2021)]); Ground Truth CUIs: [‘C1306645’, ‘C0817096’,

‘C1999039’, ‘C0039985’]; Predicted CUIs: [‘C1306645’,

‘C1999039’, ‘C1306645’, ‘C0032285’].

Figure 6: Random Test Image (CC BY [Muacevic

et al. (2021)]); Ground Truth CUIs: [‘C0002978’,

‘C0002940’, ‘C0226156’, ‘C0582802’, ‘C0007276’]; Pre-

dicted CUIs: [’C0040405’, ’C0817096’,‘C0002940’,

‘C0226156’, ‘C0582802’].

5 CONCLUSION

Valuable information is provided by harnessing the

often overlooked textual and visual content, going

beyond traditional databases. Future efforts include

generating more training data and building advanced

information retrieval systems with a fusion model.

However, the models used in the study face limi-

tations in predicting with higher accuracy followed

by an well-deﬁned pre-processing techniques of im-

Figure 7: Random Test Image (CC BY [Ruiz et al.

(2021)]); Ground Truth CUIs: [‘C0026264’, ‘C0225860’,

‘C0003483’]; Predicted CUIs: [‘C0026264’, ‘C0002978’,

‘C0456598’, ‘C0190010’,‘C0003483’].

ages using Keras networks for multi-label classiﬁca-

tion. Future work aims to overcome this by focusing

on deep learning-based object detection. The impact

of this research is substantial for applications such

as digital libraries and image search engines, which

demand efﬁcient techniques for image categorization

and access.

ACKNOWLEDGEMENTS

This work is supported by the National Science Foun-

dation (NSF) grant (ID: 2131307) under CISE-MSI

program.

REFERENCES

Azam, M. A., Khan, K. B., Salahuddin, S., Rehman, E.,

Khan, S. A., Khan, M. A., Kadry, S., and Gandomi,

A. H. (2022). A review on multimodal medical im-

age fusion: Compendious analysis of medical modal-

ities, multimodal databases, fusion techniques and

quality metrics. Computers in biology and medicine,

144:105253.

Barr

on-Cedeno, A., Da San Martino, G., Esposti, M. D.,

Faggioli, G., Ferro, N., Hanbury, A., Macdonald, C.,

Pasi, G., Potthast, M., and Sebastiani, F. (2023). Re-

port on the 13th conference and labs of the evaluation

forum (clef 2022) experimental ir meets multilingual-

ity, multimodality, and interaction. In ACM SIGIR Fo-

rum, volume 56, pages 1–15. ACM New York, NY,

USA.

Dai, Y., Gao, Y., and Liu, F. (2021). Transmed: Transform-

ers advance multi-modal medical image classiﬁcation.

Diagnostics, 11(8):1384.

Demner-Fushman, D., Antani, S., Simpson, M., and

HEALTHINF 2024 - 17th International Conference on Health Informatics

684

Thoma, G. R. (2009). Annotation and retrieval of clin-

ically relevant images. international journal of medi-

cal informatics, 78(12):e59–e67.

Dhawan, A. P. (2011). Medical image analysis. John Wiley

& Sons.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,

M., Heigold, G., Gelly, S., et al. (2020). An image is

worth 16x16 words: Transformers for image recogni-

tion at scale. arXiv preprint arXiv:2010.11929.

Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean,

J., Ranzato, M., and Mikolov, T. (2013). Devise: A

deep visual-semantic embedding model. Advances in

neural information processing systems, 26.

Hasan, M. R., Layode, O., and Rahman, M. (2023).

Concept detection and caption prediction in image-

clefmedical caption 2023 with convolutional neural

networks, vision and text-to-text transfer transform-

ers. In CLEF2023 Working Notes, CEUR Workshop

Proceedings, Thessaloniki, Greece. CEURWS.org.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,

K. Q. (2017). Densely connected convolutional net-

works. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 4700–

4708.

Ionescu, B., M

uller, H., Dr

agulinescu, A. M., Popescu, A.,

Idrissi-Yaghir, A., Garc

ıa Seco de Herrera, A., An-

drei, A., Stan, A., Stor

as, A. M., Abacha, A. B., et al.

(2023). Imageclef 2023 highlight: Multimedia re-

trieval in medical, social media and content recom-

mendation applications. In European Conference on

Information Retrieval, pages 557–567. Springer.

Ionescu, B., M

uller, H., P

eteri, R., R

uckert, J., Abacha,

A. B., de Herrera, A. G. S., Friedrich, C. M., Bloch,

L., Br

ungel, R., Idrissi-Yaghir, A., et al. (2022).

Overview of the imageclef 2022: Multimedia retrieval

in medical, social media and nature applications.

In International Conference of the Cross-Language

Evaluation Forum for European Languages, pages

541–564. Springer.

Kaliosis, P., Moschovis, G., Charalambakos, F., Pavlopou-

los, J., and Androutsopoulos, I. (2023). Aueb

nlp group at imageclefmedical caption 2023. In

CLEF2023 Working Notes, CEUR Workshop Pro-

ceedings, Thessaloniki, Greece. CEUR-WS.org.

Li, J., Li, D., Xiong, C., and Hoi, S. (2022). Blip:

Bootstrapping language-image pre-training for uniﬁed

vision-language understanding and generation. In In-

ternational Conference on Machine Learning, pages

12888–12900. PMLR.

Liu, S. and Deng, W. (2015). Very deep convolutional

neural network based image classiﬁcation using small

training sample size. In 2015 3rd IAPR Asian confer-

ence on pattern recognition (ACPR), pages 730–734.

IEEE.

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,

S., and Guo, B. (2021). Swin transformer: Hierar-

chical vision transformer using shifted windows. In

Proceedings of the IEEE/CVF international confer-

ence on computer vision, pages 10012–10022.

Manzari, O. N., Ahmadabadi, H., Kashiani, H., Shokouhi,

S. B., and Ayatollahi, A. (2023). Medvit: a ro-

bust vision transformer for generalized medical image

classiﬁcation. Computers in Biology and Medicine,

157:106791.

Mohamed, S. S. N. and Srinivasan, K. (2023). Ssn

mlrg at caption 2023: Automatic concept detection

and caption prediction using conceptnet and vision

transformer. In CLEF2023 Working Notes, CEUR

Workshop Proceedings, Thessaloniki, Greece. CEUR-

WS.org.

Okolo, G. I., Katsigiannis, S., and Ramzan, N. (2022). Ievit:

An enhanced vision transformer architecture for chest

x-ray image classiﬁcation. Computer Methods and

Programs in Biomedicine, 226:107141.

Pelka, O., Koitka, S., R

uckert, J., Nensa, F., and Friedrich,

C. M. (2018). Radiology objects in context (roco):

a multimodal image dataset. In Intravascular Imag-

ing and Computer Assisted Stenting and Large-Scale

Annotation of Biomedical Data and Expert Label

Synthesis: 7th Joint International Workshop, CVII-

STENT 2018 and Third International Workshop, LA-

BELS 2018, Held in Conjunction with MICCAI 2018,

Granada, Spain, September 16, 2018, Proceedings 3,

pages 180–189. Springer.

Pennington, J., Socher, R., and Manning, C. D. (2014).

Glove: Global vectors for word representation. In

Proceedings of the 2014 conference on empirical

methods in natural language processing (EMNLP),

pages 1532–1543.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,

Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,

J., et al. (2021). Learning transferable visual models

from natural language supervision. In International

conference on machine learning, pages 8748–8763.

PMLR.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,

Sutskever, I., et al. (2019). Language models are un-

supervised multitask learners. OpenAI blog, 1(8):9.

Rahman, M. M., Antani, S. K., Demner-Fushman, D., and

Thoma, G. R. (2015). Biomedical image representa-

tion approach using visualness and spatial information

in a concept feature space for interactive region-of-

interest-based retrieval. Journal of Medical Imaging,

2(4):046502–046502.

Rio-Torto, I., Patr

ıcio, C., Montenegro, H., Gonc¸alves, T.,

and Cardoso, J. S. (2023). Detecting concepts and

generating captions from medical images: Contribu-

tions of the vcmi team to imageclefmedical caption

2023. In CLEF2023 Working Notes, Thessaloniki,

Greece. CEUR-WS.org, CEUR Workshop Proceed-

ings.

Ritter, F., Boskamp, T., Homeyer, A., Laue, H., Schwier,

M., Link, F., and Peitgen, H.-O. (2011). Medical im-

age analysis. IEEE pulse, 2(6):60–70.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019).

Image and Text Feature Based Multimodal Learning for Multi-Label Classiﬁcation of Radiology Images in Biomedical Literature

685

Distilbert, a distilled version of bert: smaller, faster,

cheaper and lighter. In NeurIPS EMC

Workshop.

Shinoda, H., Aono, M., Asakawa, T., Shimizu, K., Ko-

moda, T., and Togawa, T. (2023). Kde lab at im-

ageclefmedical caption 2023. In CLEF2023 Working

Notes, CEUR Workshop Proceedings, Thessaloniki,

Greece. CEUR-WS.org.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wo-

jna, Z. (2016). Rethinking the inception architecture

for computer vision. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 2818–2826.

Tajbakhsh, N., Shin, J. Y., Gurudu, S. R., Hurst, R. T.,

Kendall, C. B., Gotway, M. B., and Liang, J. (2016).

Convolutional neural networks for medical image

analysis: Full training or ﬁne tuning? IEEE trans-

actions on medical imaging, 35(5):1299–1312.

Tan, M. and Le, Q. (2019). Efﬁcientnet: Rethinking model

scaling for convolutional neural networks. In Interna-

tional conference on machine learning, pages 6105–

6114. PMLR.

Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., and

egou, H. (2021). Going deeper with image transform-

ers. In Proceedings of the IEEE/CVF international

conference on computer vision, pages 32–42.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in neural

information processing systems, 30.

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudi-

nov, R., Zemel, R., and Bengio, Y. (2015). Show, at-

tend and tell: Neural image caption generation with

visual attention. In International conference on ma-

chine learning, pages 2048–2057. PMLR.

Yao, J., Zhu, X., and Huang, J. (2019). Deep multi-

instance learning for survival prediction from whole

slide images. In Medical Image Computing and Com-

puter Assisted Intervention–MICCAI 2019: 22nd In-

ternational Conference, Shenzhen, China, October

13–17, 2019, Proceedings, Part I 22, pages 496–504.

Springer.

Yeshwanth, V., P, P., and Kalinathan, L. (2023). Concept de-

tection and image caption generation in medical imag-

ing. In CLEF2023 Working Notes, CEUR Workshop

Proceedings, Thessaloniki, Greece. CEUR-WS.org.

Yu, X., Yu, Z., and Ramalingam, S. (2018). Learning strict

identity mappings in deep residual networks. In Pro-

ceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 4432–4440.

Zhang, M.-L. and Zhou, Z.-H. (2013). A review on

multi-label learning algorithms. IEEE transactions on

knowledge and data engineering, 26(8):1819–1837.

HEALTHINF 2024 - 17th International Conference on Health Informatics

686